title: Red Wine Analysis by R author: Kholood Alsaggaf

Abstract: an analysisof Red Wine Dataset has been conducted to understand the responsible variables for the quality of the wine. by finding the correlation between them and the Wine Quality with other factors.in conclusion predict the outcome of a test set data by a linear model.

========================================================

Structure and summary of Dataframe

About the data: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine.

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ rating              : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality     rating    
##  Min.   : 8.40   3: 10   bad    :  63  
##  1st Qu.: 9.50   4: 53   average:1319  
##  Median :10.20   5:681   good   : 217  
##  Mean   :10.42   6:638                 
##  3rd Qu.:11.10   7:199                 
##  Max.   :14.90   8: 18

Univariate Plots Section

First, plot the distribution of each variable to get an idea of the data, then observe the distribution shape. lastly remove the extreme outliers to get a true clear analysis.

Observations:Fixed Acidity distribution is positively skewed, median is around 8 with high concentration of wines with Fixed Acidity. the plots has been modified to exclude extreme outliers.

Observations: Volatile acidity distribution is Bimodal with two peaks at 0.4 and 0.6.

Observations: Citric acid has no clear visiual distribution, there is somthing wrong with the data.

Observations: Residual Sugar distribution is positively skewed with high peaks at around 2 and many outliers at the higher ranges.

Observations: Chlorides distribution is positively skewed. the plots has been modified to exclude extreme outliers.

Observations: Free Sulphur Dioxide distribution is positively skewed, there is a high peak at 7 but it continue the same positively skewed patterns with outliers in the high range.

Observations: Total Sulphur Dioxide distribution is also positively skewed.

Observations: Density has Normal Distribution.

Observations: pH distributetion is a Normally distributetion.

Observations: Sulphates distribution is also positively skewed, with few outliers.

Observations: Alcohol has kind of positive skewed distribution but the skewness is less than the above.

observation: most of the wines in the dataset are average quality wines. we aren’t sure if the data accurate and complete, because good quality and the poor quality wines are almost like outliers.

Univariate Analysis

structure of the dataset.

The Red Wine Dataset had 1599 rows and 13 columns originally, the number of columns became 14 after adding a new column called ‘rating’, ‘quality’ is a categorical variable, and the rest of the variables are numerical variables which reflect the physical and chemical properties of the wine. From what we have observed, the most of the wines are ‘average’ quality with very few ‘bad’ and ‘good’, the challenge is to build the right predictive model when there isn’t enough data for the Good Quality and the Bad Quality wines.

main feature of interest in the dataset.

The main feature of interest is the ‘quality’ and investigate which factors determine the quality of a wine.

other features in the dataset will help investigation into the feature of interest.

The acidity which is fixed, volatile or citric changes the quality of the wine based on their values, as well as the pH may have some effect on the quality, also the residual sugar may have an effect on the wine quality because sugar determines the sweetness of the wine and may affect the wine taste.

new variables created from existing variables in the dataset.

converting quality from Int to Factor and then added new column called ‘rating’ based on ‘quality’.

unusual distributions.

Citric acid has no clear visiual distribution as compared to the rest numeric variables, there is somthing wrong with the data as if it’s an incomplete data collection.

Bivariate Plots Section

This is a correlation table between dataset variables to see which varibles may be correlated with each other.

## 
## ---------------------------------------------------------------------------
##           &nbsp;            fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203        **0.3649**  
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261        **0.3128**  
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241         **-0.3906**        0.2264    
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------------------------------------------
##           &nbsp;            residual.sugar   chlorides    free.sulfur.dioxide 
## -------------------------- ---------------- ------------ ---------------------
##     **fixed.acidity**           0.1148        0.09371           -0.1538       
## 
##    **volatile.acidity**        0.001918        0.0613           -0.0105       
## 
##      **citric.acid**            0.1436         0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561            0.187        
## 
##       **chlorides**            0.05561           1             0.005562       
## 
##  **free.sulfur.dioxide**        0.187         0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203          0.0474         **0.6677**      
## 
##        **density**            **0.3553**       0.2006          -0.02195       
## 
##           **pH**               -0.08565        -0.265           0.07038       
## 
##       **sulphates**            0.005527      **0.3713**         0.05166       
## 
##        **alcohol**             0.04208        -0.2211          -0.06941       
## 
##        **quality**             0.01373        -0.1289          -0.05066       
## ------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##           &nbsp;            total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553          **0.3649**    **-0.5419** 
## 
##     **residual.sugar**             0.203           **0.3553**     -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1        **-0.3417** 
## 
##           **pH**                  -0.06649         **-0.3417**        1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------------
##           &nbsp;            sulphates      alcohol       quality   
## -------------------------- ------------ ------------- -------------
##     **fixed.acidity**         0.183       -0.06167       0.1241    
## 
##    **volatile.acidity**       -0.261       -0.2023     **-0.3906** 
## 
##      **citric.acid**        **0.3128**     0.1099        0.2264    
## 
##     **residual.sugar**       0.005527      0.04208       0.01373   
## 
##       **chlorides**         **0.3713**     -0.2211       -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166      -0.06941      -0.05066   
## 
##  **total.sulfur.dioxide**    0.04295       -0.2057       -0.1851   
## 
##        **density**            0.1485     **-0.4962**     -0.1749   
## 
##           **pH**             -0.1966       0.2056       -0.05773   
## 
##       **sulphates**             1          0.09359       0.2514    
## 
##        **alcohol**           0.09359          1        **0.4762**  
## 
##        **quality**            0.2514     **0.4762**         1      
## -------------------------------------------------------------------
  1. Quality are Volatile Acidity and Alcohol strongly correlated.

  2. Density has a very strong correlation with Fixed Acidity.

  3. Volatile acidity has a positive correlation with pH.

  4. Alcohol has negative correlation with density.

These are a Box plots between the variables.

The fixed acidity mean and median values doesn’t changes with the increase in quality, so fixed acidity has no effect on quality.

Volatile acid have a negative correlation with quality, so if volatile acid level increase the quality of the wine decrease.

Citric acid have a positive correlation with Wine Quality. when citric acid increase the wine qality increases.

Residual Sugar has no impact on the quality of the Wine. The mean values for the residual sugar is almost the same.

Chlorides has negative corrlatin with quality, whenever Chlorides decrease the quality increase.

we noticed that decreases of Free Sulphur Dioxide produces poor wine and increases of Sulphur Dioxide produces average wine.

good quality wines looks like they have lower densities.

decreases in pH preduces better wine, but there are a few outliers here, therefor we need to see how acids affects pH.

The three plots has negative correlation on pH except volatile acidity, but acidity has a negative correlation with pH how’s that possible!. Let’s investigate.

Simpson’s paradox was responsible for the trend reversal of Volatile Acid vs pH.

whenever Sulphates increases the quality become better.

It seems better wines have higher Alcohol content, but there is high outliers that affect the result, so it might be that alcohol alone doesn’t affecte good quality wine. A linear model will help to get it clear.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

According to R-squared value it shows that alcohol alone affect only 22% of Wine quality, so there must be other variables that affects the quality.

plot correlation test against each variable to the wine quality.

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

these variables have higher correlation to Wine Quality. 1. Alcohol 2. Sulphates(log10) 4. Citric Acid

Bivariate Analysis

Observations, relationships and variation with feature of interest and other features.

  1. Fixed Acidity have almost no effect on quality.
  2. Volatile Acidity have a negative correlation with quality.
  3. Better wines have lower densities, but this may be due to the higher alcohol content.
  4. Better wines have higher concentration of Citric Acid.
  5. Better wines seem to be more acidic.
  6. Better wines have higher alcohol percentages, but linear model showed that the R squared value of alcohol only contributes 20% on the variance of the quality. So there may be some other factors affecting the result.
  7. lower percent of Chloride seems to produce better quality wines.
  8. Residual sugar almost has no effect on the wine quality.
  9. Volatile acidity have a positive correlation with pH this was due to the Simpson’s Paradox. 10.Alcohol has a strong effect at the quality of the wine even though it actually contributes only 22% of the total quality.

Multivariate Plots Section

Based on what observed that Alcohol has a strong effect at the quality, will investigate and try to insert more variables to show if they contribute to the overall quality.

The plot shows that correlation of density with quality was due to alcohol percent as showen in the plot density doesn’t have a clear effect in changing the quality.

Wines with higher alcohol content and higher level of Sulphates produce better wine.

less concentration of volatile acid and higher concentration of alcohol produces better wines.

low pH and high Alcohol percentage produces better wines.

No correlation between residual sugar and quality.

There are few high outliers for better wine with high Sulphur Dioxidelower but mostly Sulphur Dioxide produces better wine.

Now will investigate the effect of acids on quality of wines.

Higher Citric Acid and low Volatile Acid produces better Wines.

not clear correlations.

not clear correlations.

Now will create a linear model with the variables which are most strongly correlated with the quality of the wine.

set.seed(1221)
training_data <- sample_frac(wine, .6)
test_data <- wine[ !wine$X %in% training_data$X, ]
m1 <- lm(as.numeric(quality) ~ alcohol, data = training_data)
m2 <- update(m1, ~ . + sulphates)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + fixed.acidity)
m6 <- update(m2, ~ . + pH)
mtable(m1,m2,m3,m4,m5,m6)
## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = training_data)
## 
## ====================================================================================================
##                          m1            m2           m3           m4           m5           m6       
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           0.155        -0.273        0.866***     0.973***     0.497        1.494**   
##                        (0.220)       (0.224)      (0.247)      (0.254)      (0.287)      (0.515)    
##   alcohol               0.333***      0.320***     0.286***     0.284***     0.296***     0.339***  
##                        (0.021)       (0.021)      (0.020)      (0.020)      (0.020)      (0.021)    
##   sulphates                           0.855***     0.599***     0.650***     0.667***     0.733***  
##                                      (0.126)      (0.124)      (0.127)      (0.126)      (0.129)    
##   volatile.acidity                                -1.153***    -1.279***    -1.352***               
##                                                   (0.124)      (0.143)      (0.144)                 
##   citric.acid                                                  -0.231       -0.629***               
##                                                                (0.132)      (0.174)                 
##   fixed.acidity                                                              0.058***               
##                                                                             (0.017)                 
##   pH                                                                                     -0.569***  
##                                                                                          (0.149)    
## ----------------------------------------------------------------------------------------------------
##   R-squared             0.209         0.245        0.308        0.310        0.319        0.256     
##   adj. R-squared        0.208         0.243        0.306        0.307        0.315        0.254     
##   sigma                 0.707         0.691        0.662        0.661        0.657        0.686     
##   F                   252.335       155.125      141.769      107.317       89.264      109.700     
##   p                     0.000         0.000        0.000        0.000        0.000        0.000     
##   Log-likelihood    -1027.549     -1004.996     -963.139     -961.610     -955.575     -997.782     
##   Deviance            478.652       456.660      418.487      417.154      411.937      449.841     
##   AIC                2061.098      2017.992     1936.279     1935.219     1925.150     2005.565     
##   BIC                2075.695      2037.456     1960.608     1964.415     1959.211     2029.894     
##   N                   959           959          959          959          959          959         
## ====================================================================================================

Multivariate Analysis

Observed relationships and features that strengthened each other.

  1. High Alcohol and Sulaphate content produces better wines.
  2. Even though Citric Acid weakly correlated, Higher Citric Acid and low Volatile Acid produces better Wines.

models with the dataset

linear models were created for the dataset, from what observed alcohol contributes only 22% of the Wine quality and most of the factors converged on Average quality wines. This can be due to the fact that the dataset comprised mainly of ‘Average’ quality wines and few data about ‘Good’ and ‘Bad’ quality wines. the linear model equations produced has low confidence level due to the low R squared value. It’s difficult to predict statistics for incomplete dataset.


Final Plots and Summary

from what we observed, Alcohol and Sulphates has stronge effect in determining alcohol quality. Also the linear model shows the variation in the error percentage with different qualities of Wine.

Plot 1

Description One

The higher alcohol percentage, the better the wine quality, so alcohol percentage has stronge effect in determining the quality of Wines. Even though most of the factors converged are on Average quality wines, a very high value of median in the best quality wines means that almost all points have a high percentage of alcohol. But alcohol is not the only factor that is responsible for the improvement in quality as we saw in linear model.

Plot Two

Description Two

from what observed in the plot, High alcohol contents and high sulphate concentrations produces better wines. the slight downwards slope in best quality wines maybe due to the percentage of alcohol slightly greater than the concentration of Sulphates.

Plot Three

df <- data.frame(
  test_data$quality,
  predict(m5, test_data) - as.numeric(test_data$quality)
)
names(df) <- c("Quality", "Error")
ggplot(data=df, aes(x=Quality,y=Error)) +
  geom_jitter(alpha = 0.3) +
  ggtitle("Linear model errors vs. expected quality")

Description Three

The plot shows that error is clearly intense in the ‘Average’ quality section than ‘Good’ and ‘Bad’ quality wines which indecates the fact that most of our dataset contains ‘Average’ quality wines. The linear model with R squared value for m5 explain around 33% change in quality, and due to the lack of information the earlier models isn’t the best model to predict both ‘Good’ and ‘Bad’ quality wines.


Reflection

In conclusion, what we have perfomed in the analysis process are first create plots for different variables against the quality to understand the relationships between them and then investaget and find out the correlation between them and wine quality, we found that the factors which mostly affectes the quality of the wine were Alcohol percentage, Sulphate and Acid concentrations. We also found an interesting phenomenon where volatile acidity had a unexpected positive correlation with pH and we found out that this was due to the Simpson’s Paradox. Then we investaget more to finlize the analysis by creating a multivariate plots to find a combinations of variables which affecteing the overall wine quality.

The main struggle in this dataset analysis was to get a higher confidence level on predicting factors that are effecting the different quality of wines especially the ‘Good’ and the ‘Bad’ since the data was very centralized around the ‘Average’ quality, the training set have an incomplete data which makes it difficult to build an accurate model. From what we observed some wines contains citric acid and others doesn’t. we relized that citric acid is added to some wines in order to increase the acidity, that’s why some wines showed almost a rectangular distribution.

Insights in the future analysis, I hope to have a complete dataset to helpe better in predicting the higher range values and an aqurate modles.